Loan Data Exploration by Wan Li
This data set contains 113,937 loans with 81 variables on each loan. I select the following 13 variables to explore: ListingKey, LoanStatus, BorrowerRate, ProsperScore, EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper, StatedMonthlyIncome, LoanMonthsSinceOrigination, LoanOriginalAmount, MonthlyLoanPayment, LP_CustomerPayments, Term.
Univariate Plots Section
## ListingKey LoanStatus BorrowerRate ProsperScore
## 1 1021339766868145413AB3B Completed 0.1580 NA
## 2 10273602499503308B223C1 Current 0.0920 7
## 3 0EE9337825851032864889A Completed 0.2750 NA
## 4 0EF5356002482715299901A Current 0.0974 9
## 5 0F023589499656230C5E3E2 Current 0.2085 4
## 6 0F05359734824199381F61D Current 0.1314 10
## EmploymentStatus CreditScoreRangeLower CreditScoreRangeUpper
## 1 Self-employed 640 659
## 2 Employed 680 699
## 3 Not available 480 499
## 4 Employed 800 819
## 5 Employed 680 699
## 6 Employed 740 759
## StatedMonthlyIncome LoanMonthsSinceOrigination LoanOriginalAmount
## 1 3083.333 78 9425
## 2 6125.000 0 10000
## 3 2083.333 86 3001
## 4 2875.000 16 10000
## 5 9583.333 6 15000
## 6 8333.333 3 15000
## MonthlyLoanPayment LP_CustomerPayments Term
## 1 330.43 11396.14 36
## 2 318.93 0.00 36
## 3 123.32 4186.63 36
## 4 321.45 5143.20 36
## 5 563.97 2819.85 36
## 6 342.37 679.34 60
## [1] 113937 13
## 'data.frame': 113937 obs. of 13 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanMonthsSinceOrigination: int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## [1] "Cancelled" "Chargedoff"
## [3] "Completed" "Current"
## [5] "Defaulted" "FinalPaymentInProgress"
## [7] "Past Due (>120 days)" "Past Due (1-15 days)"
## [9] "Past Due (16-30 days)" "Past Due (31-60 days)"
## [11] "Past Due (61-90 days)" "Past Due (91-120 days)"
## [1] "" "Employed" "Full-time" "Not available"
## [5] "Not employed" "Other" "Part-time" "Retired"
## [9] "Self-employed"
I subset the dataset to include only 12 variables with 113937 observation.
## [1] FALSE
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
The levels in LoanStatus is not ordered, and there are too many levels for past due.

##
## Completed FinalPaymentInProgress Current
## 38074 205 56576
## Past Due (1-15 days) Past Due (16-30 days) Past Due (31-60 days)
## 806 265 363
## Past Due (61-90 days) Past Due (91-120 days) Past Due (>120 days)
## 313 304 16
## Defaulted Chargedoff Cancelled
## 5018 11992 5
I ordered the LoanStatus levels from completed, nearly completed, current, to past due, defaulted, chargedoff and cancelled.

##
## Completed Current PastDue Defaulted Chargedoff Cancelled
## 38074 56781 2067 5018 11992 5
I combined the pastdue levels together, and FinalPaymentInProgress to Current.The majority of the observations are current or completed, and the minority are PastDue, Defaulted and Chargedoff. The cancelled loans are very rare.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The highest BorrowRate is 0.4975 and the lowest is 0. The majority of observations have the BorrowRate between 0.05 and 0.35. The interval near 0.32 has an unusual high frequency. I wonder what factors influence the BorrowRate.
## Warning: Removed 29084 rows containing non-finite values (stat_count).

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 4.00 6.00 5.95 8.00 11.00 29084
ProsperScore is discrete quantitative variable, ranging from 1 to 11. The distribution is quite symmetric with most people having medium ProsperScore between 4 and 8.
## [1] FALSE
##
## Employed Full-time Not available Not employed
## 2255 67322 26355 5347 835
## Other Part-time Retired Self-employed
## 3806 1088 795 6134
The levels of EmploymentStatus is not ordered and some levels overlap, like Employed and Full-time.


## Employed Self-employed Retired Not employed Unkown
## 94765 6134 795 835 11408
I ordered the levels and combined the Full-time, Part-time and Employed to one level: Employed. Most of the observations have Employed status, with 6134 people being self-employed. The employment statuses of 11408 people are not clear. A small number of oberservations have Retired or Not employed status.

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
##
## 19
## 113346
The CreditScoreRangeLower equals CreditScoreRangeLower plus 19. I create a new variable CreditScore, which is average of CreditScoreRangeLower and CreditScoreRangeUpper. The CreditScore have discrete values.

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.5 669.5 689.5 695.1 729.5 889.5 591
## [1] 649.5 689.5 489.5 809.5 749.5 709.5 829.5 769.5 669.5 629.5 729.5
## [12] 529.5 789.5 609.5 589.5 549.5 569.5 509.5 849.5 869.5 NA 469.5
## [23] 9.5 889.5 449.5 429.5 369.5
Most people have CreditScore between 600 and 800.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
There are some outliers in StatedMonthlyIncome. So I trimmed the highest 1% of the data.Distribution of StatedMonthlyIncome is long-tailed, skewed to the right. So I log-transformed the data. The distribution of transformed data is close to normal distribution.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 6.0 21.0 31.9 65.0 100.0
Much more people started loan in the last 10 months than earlier months, which shows economic resurgence. There is alomost no people starting loans 55-64 months ago, which may be because of the economic crisis.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
## [1] 14333
The amount of the original loan ranges from $1000 to $35000. 3 quarters of the loans are below 12000. Most of the loans are close to certain numbers like $4000, $10000, $1500, $5000. For example, there are 14333 loans with the amount of $4000.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2251.5
Most loans have less than $400 MonthlyLoanPayment. The interval near 170 has the highest frequency.The log-transformed distribution apears to be bimodal peaking around 160 and again around 330.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.35 1005.76 2583.83 4183.08 5548.40 40702.39
Distribution of LP_CustomerPayments is long-tailed.Log-transformed LP_CustomerPayments distribution appears close to normal distribution with peak at about 4000.
##
## 12 36 60
## 1614 87778 24545

There are 3 unique values for Term. The majority of observations have term of 36 months and term of 12 months is quite rare. I change the datatype of Term to factor.
Univariate Analysis
Structure of The Dataset
This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. I select the following 12 variables to explore: ListingKey, LoanStatus, BorrowerRate, ProsperScore, EmploymentStatus, CreditScoreRangeLower, CreditScoreRangeUpper, StatedMonthlyIncome, LoanMonthsSinceOrigination, LoanOriginalAmount, MonthlyLoanPayment, LP_CustomerPayments, Term.
ListingKey is a factor variable showing the unique identifier of each observation. The other variables are features of the observation. The variables LoanStatus and EmploymentStatus are unordered factors. The other variables are numerical or integer variables. I created new variable CreditScore.
- LoanStatus: Most observations are Current or Completed.
- EmploymentStatus: Most obeservations has Employed EmploymentStatus.
- BorrowerRate: The majority of the Borrow Rate fall between 0.05 and 0.35.
- ProsperScore: Most people having medium ProsperScore between 4 and 8.
- CreditScore: Most people have CreditScore between 600 and 800.
- StatedMonthlyIncome: The median StatedMonthlyIncome is 4667.
- LoanMonthsSinceOrigination: Much more people started loan in the last 10 months than earlier months. There is a break between 55 and 64.
- LoanOriginalAmount: Ranging from 1000 to 35000. The median is 6500.
- MonthlyLoanPayment: The median MonthlyLoanPayment is 217.7.
- LP_CustomerPayments: The median LP_CustomerPayments is 2583.83.
- Term: Most loans have term of 36 months.
Main Features of Interest
The main features of interest in the dataset are BorrowerRate, ProsperScore and CreditScore. I’d like to explore which feautures are mostly related to BorrowerRate. I think ProsperScore, CreditScore and other variables may be used to decide BorrowRate.
Other Features in the Dataset
I guess BorrowRate is related with ProsperScore and CreditScore. Other features like LoanStatus, EmploymentStatus, CreditScore, StatedMonthlyIncome, LoanOriginalAmount, and MonthlyLoanPayment may also influence BorrowRate.
New Variable from Existing Variables
The CreditScoreRangeLower equals CreditScoreRangeLower plus 19. I create a new variable CreditScore to represent the credit condition, which is average of CreditScoreRangeLower and CreditScoreRangeUpper. The CreditScore have discrete values.
Bivariate Plots Section

From a subset of the data, BorrowerRate is most strongly related with ProsperScore, followed by CreditScore and LoanOriginalAmount. StatedMonthlyIncome, LoanMonthsSinceOrigination, MonthlyLoanPayment and LP_CustomerPayments don’t have strong correlation with BorrowerRate. StatedMonthlyIncome has moderate relationship with LoanOriginalAmount. The boxplots show that LoanStatus and EmploymentStatus may influence the BorrowerRate. I would like to look closer at the scatter plots and box plots involving BorrowerRate and Other variables.

##
## Pearson's product-moment correlation
##
## data: loan_sub$BorrowerRate and loan_sub$ProsperScore
## t = -248.98, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6536072 -0.6458311
## sample estimates:
## cor
## -0.6497361
The mean of BorrowerRate (red line) decreases aas ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship. The correlation coefficient is -0.65. BorrowerRate has strong relationship with ProsperScore.


##
## Pearson's product-moment correlation
##
## data: loan_sub$BorrowerRate and loan_sub$CreditScore
## t = -175.17, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4661358 -0.4569730
## sample estimates:
## cor
## -0.4615667
I trimmed the bottom and top 1% of the CreditScore data. The mean of BorrowerRate (red line) decreases as CreditScore increases. The variance of BorrowerRate is big. The correlation between these two variables is -0.46, which shows a moderate linear relationship.

The original plot of conditional mean (binwidth=100) has too much noise. I smoothed it with binwidth of 500 and add a smooth layer (blue line).

##
## Pearson's product-moment correlation
##
## data: loan_sub$BorrowerRate and loan_sub$LoanOriginalAmount
## t = -117.58, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3341283 -0.3237719
## sample estimates:
## cor
## -0.3289599
The variance of BorrowerRate is big.As a whole, the conditional mean of BorrowRate decreases as LoanOriginalAmount increases, but there are many fluctuations. The correlation coefficient between these two variables is -0.33, showing a moderate relationship.

## loan_sub$LoanStatus: Completed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1173 0.1744 0.1864 0.2511 0.4975
## --------------------------------------------------------
## loan_sub$LoanStatus: Current
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0577 0.1314 0.1760 0.1838 0.2310 0.3304
## --------------------------------------------------------
## loan_sub$LoanStatus: PastDue
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0599 0.1885 0.2399 0.2344 0.2916 0.3435
## --------------------------------------------------------
## loan_sub$LoanStatus: Defaulted
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1650 0.2296 0.2231 0.2875 0.4975
## --------------------------------------------------------
## loan_sub$LoanStatus: Chargedoff
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1769 0.2400 0.2354 0.2975 0.4500
## --------------------------------------------------------
## loan_sub$LoanStatus: Cancelled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1075 0.1395 0.2000 0.1844 0.2375 0.2375
The means of BorrowerRate for PastDue, Defaulted and Chargedoff loans are higher than Completed and Current loans.

## loan_sub$EmploymentStatus: Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1314 0.1800 0.1911 0.2498 0.3600
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Self-employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1400 0.1899 0.2023 0.2695 0.3500
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0500 0.1202 0.1829 0.1944 0.2625 0.3500
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.1830 0.2599 0.2441 0.3134 0.3500
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Unkown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1449 0.1989 0.1977 0.2537 0.4975
Not employed people have much higher median BorrowerRate than other groups.

##
## Pearson's product-moment correlation
##
## data: loan_sub$LoanOriginalAmount and loan_sub$StatedMonthlyIncome
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1956816 0.2068243
## sample estimates:
## cor
## 0.2012595
The correlation between StatedMonthlyIncome and LoanOriginalAmount is not strong. But the condition mean shows that the mean of LoanOriginalAmount increases as StatedMonthlyIncome increases.

##
## Pearson's product-moment correlation
##
## data: loan_sub$MonthlyLoanPayment and sqrt(loan_sub$StatedMonthlyIncome)
## t = 135.8, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3682467 0.3782419
## sample estimates:
## cor
## 0.3732551
##
## Pearson's product-moment correlation
##
## data: loan_sub$MonthlyLoanPayment and loan_sub$StatedMonthlyIncome
## t = 67.764, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1912423 0.2024055
## sample estimates:
## cor
## 0.1968303
##
## Pearson's product-moment correlation
##
## data: loan_sub$MonthlyLoanPayment and log10(loan_sub$StatedMonthlyIncome + 1)
## t = 78.974, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2223015 0.2333118
## sample estimates:
## cor
## 0.2278139
The conditional mean of MonthlyLoanPayment increases as StateMonthlyIncome increases. Sqrt-transformed StateMonthlyIncome seems to have better linear relationship with the condition mean of MonthlyLoanPayment. The Correlation between square root of StateMonthlyIncome and MonthlyLoanPayment is 0.37, showing a moderate linear relationship.

##
## Pearson's product-moment correlation
##
## data: loan_sub$ProsperScore and loan_sub$CreditScore
## t = 115.87, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3637793 0.3753979
## sample estimates:
## cor
## 0.369603
ProsperScore and CreditScore have moderate positive relationship with correlation coefficient of 0.37.

##
## Pearson's product-moment correlation
##
## data: loan_sub$BorrowerRate and loan_sub$StatedMonthlyIncome
## t = -30.155, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09473938 -0.08321827
## sample estimates:
## cor
## -0.0889818
Conditional mean of BorrowerRate decreases as the StatedMonthlyIncome increases. But the variance is big and correlation is weak.

The term of loan only has a few discrete values, the MonthlyLoanPayment is mainly determined by LoanOriginalAmount, loan term and borrower rate.

## loan_sub$EmploymentStatus: Employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3333 4981 5782 7083 483333
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Self-employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2734 4333 6338 7083 1750003
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Retired
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1575 2617 2987 3750 83333
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Not employed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 196.5 0.0 20833.3
## --------------------------------------------------------
## loan_sub$EmploymentStatus: Unkown
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2333 3500 4346 5250 208333
Employed people have higher income than Self-employed people who have higher income than retired people. Not employed people mostly don’t have any monthly income.

Current loans have the highest StatedMonthlyIncome. PastDue loans have lower StatedMonthlyIncome. Defaulted and Chargedoff loans has lower median StatedMonthlyIncome than PastDue loans. StatedMonthlyIncome of completed loans is in the middle.

Completed loans have the highest median ProsperScore. PastDue and Chargedoff loans have the lowest median Prosperscore.

Retired people have highest ProsperScore and self-employed people have lowest Prosperscore.

## NULL
Although BorrowerRate doesn’t seem to have any linear relationship with LoanMonthsSinceOrigination. BorrowerRate does seem to change with time. We can see that in different periods, the BorrowerRate has different ranges and patterns.

Loans with term of 12 tends to have lower BorrowerRate. Loans with term of 36 have larger range and variance.
Bivariate Analysis
Relationships Between Features of Interest and Other features
The mean of BorrowerRate decreases as ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship. The correlation coefficient is -0.65. BorrowerRate has strong negative relationship with ProsperScore.
The mean of BorrowerRate decreases as CreditScore increases. The variance of BorrowerRate is big. The correlation between these two variables is -0.46, which shows a moderate linear relationship.
The conditional mean of BorrowerRate decreases as LoanOriginalAmount increases, but there are many fluctuations. The correlation coefficient between these two variables is -0.33, showing a moderate relationship.
The means of BorrowerRate for PastDue, Defaulted and Chargedoff loans are higher than Completed and Current loans.
Not employed people have much higher median BorrowerRate than other groups.
Conditional mean of BorrowerRate decreases as the StatedMonthlyIncome increases. But the variance is big and correlation is weak.
Although BorrowerRate doesn’t seem to have any linear relationship with LoanMonthsSinceOrigination. BorrowerRate does seem to change with time. We can see that in different periods, the BorrowerRate has different ranges and patterns.
Interesting Relationships between the Other Features
The correlation between StatedMonthlyIncome and LoanOriginalAmount is not strong. But the condition mean shows that the mean of LoanOriginalAmount increases as StatedMonthlyIncome increases.
The conditional mean of MonthlyLoanPayment increases as StateMonthlyIncome increases. Sqrt-transformed StateMonthlyIncome seems to have better linear relationship with the condition mean of MonthlyLoanPayment. The Correlation between square root of StateMonthlyIncome and MonthlyLoanPayment is 0.37, showing a moderate positive linear relationship.
ProsperScore and CreditScore have moderate positive relationship with correlation coefficient of 0.37.
The term of loan only has a few discrete values, the MonthlyLoanPayment is mainly determined by LoanOriginalAmount, loan term and borrower rate. MonthlyLoanPayment and LoanOriginalAmount have strong linear relationship.
Employed people have higher income than Self-employed people who have higher income than retired people. Not-employed people mostly don’t have any monthly income.
Current loans have the highest StatedMonthlyIncome. PastDue loans have lower StatedMonthlyIncome. Defaulted and Chargedoff loans has lower median StatedMonthlyIncome than PastDue loans. StatedMonthlyIncome of completed loans is in the middle.
Completed loans have the highest median ProsperScore. PastDue and Chargedoff loans have the lowest median Prosperscore.
Retired people have highest ProsperScore and self-employed people have lowest Prosperscore.
The Strongest Relationship I found
BorrowerRate has strong negative relationship with ProsperScore and moderate negative relationship with CreditScore. I think ProsperScore and Credit Score could be used in a model to predict the BorrwerRate.
Multivariate Plots Section
We look at the other variables against the main BorrowerRate vs ProsperScore relationship. I cut the CreditScore, LoanOriginalAmount and LoanMonthsSinceOrigination to create discete variables creditcut, amountcut and monthcut with only a few levels. I will only explore observations with a ProsperScore.

##
## (0,650] (650,700] (700,750] (750,800] (800,900]
## 11542 28152 32515 8958 3686

## loan_sub$creditcut: (0,650]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 3.00 5.00 4.99 7.00 11.00
## --------------------------------------------------------
## loan_sub$creditcut: (650,700]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.358 7.000 11.000
## --------------------------------------------------------
## loan_sub$creditcut: (700,750]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 6.000 6.151 8.000 11.000
## --------------------------------------------------------
## loan_sub$creditcut: (750,800]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 8.000 7.319 9.000 11.000
## --------------------------------------------------------
## loan_sub$creditcut: (800,900]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.00 9.00 8.38 10.00 11.00



If we count for constant ProsperScore value, the pattern shows that loans with higher CreditScore tend to have lower BorrowerRate.This pattern holds across each level of amountcut.In terms of LoanStatus, this pattern is clear for Completed and Current loans, but is not so clear for PastDue, Defaulted and Chargedoff cases. In terms of EmploymentStatus, this pattern holds for Employed and Self-employed levels, but gets unclear for other levels. In terms of Term, the pattern is not clear for 12 months level.

From this plot we can see that the pattern changed a lot between (-1,13] and (18,38], which means the relationship between BorrowerRate and ProsperScore is different in different periods of time.

If we count for constant ProsperScore value, the pattern shows that Current loans tend to have lower BorrowerRate than PastDue loans and Completed loans. Defaulted and Chargedoff loans have similar, highest BorrowerRate.

If we count for constant ProsperScore value, the pattern shows that unemployed people tend to have highest BorrowerRate, Retired people have the second highest BorrowerRate, employed people in third place and self-employed have lowest BorrowerRate.

Loans with smaller original amount tend to have higher BorrowerRate.

The conditional mean of BorrowerRate with given ProsperScore is related with different periods of time. There are roughly 3 periods: 0-13, 13-38, 38-57. We can see the relationship between BorrowerRate and ProsperScore is more clear in loans within the last 13 month.
I will use these variables to build a linear model to predict the BorrowerRate.
##
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loan_sub2)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loan_sub2)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut,
## data = loan_sub2)
## m4: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut +
## LoanStatus, data = loan_sub2)
## m5: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut +
## LoanStatus + EmploymentStatus, data = loan_sub2)
## m6: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut +
## LoanStatus + EmploymentStatus + monthcut, data = loan_sub2)
## m7: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + amountcut +
## LoanStatus + EmploymentStatus + monthcut + Term, data = loan_sub2)
##
## =================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 0.317*** 0.645*** 0.604*** 0.657*** 0.656*** 0.627*** 0.580***
## (0.001) (0.003) (0.003) (0.003) (0.003) (0.002) (0.002)
## ProsperScore -0.020*** -0.017*** -0.015*** -0.016*** -0.016*** -0.018*** -0.018***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## CreditScore -0.000*** -0.000*** -0.000*** -0.000*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## amountcut: (4000,10000]/(0,4000] -0.030*** -0.024*** -0.024*** -0.018*** -0.023***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## amountcut: (10000,15000]/(0,4000] -0.043*** -0.031*** -0.031*** -0.021*** -0.031***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## amountcut: (15000,35000]/(0,4000] -0.049*** -0.032*** -0.033*** -0.016*** -0.026***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## LoanStatus: Completed/Chargedoff -0.026*** -0.025*** -0.021*** -0.017***
## (0.001) (0.001) (0.001) (0.001)
## LoanStatus: Current/Chargedoff -0.057*** -0.057*** -0.020*** -0.021***
## (0.001) (0.001) (0.001) (0.001)
## LoanStatus: Defaulted/Chargedoff 0.002 0.002 0.001 0.001
## (0.002) (0.002) (0.001) (0.001)
## LoanStatus: FinalPaymentInProgress/Chargedoff -0.048*** -0.048*** -0.021*** -0.018***
## (0.003) (0.003) (0.003) (0.003)
## LoanStatus: Past Due (>120 days)/Chargedoff -0.012 -0.011 0.001 -0.003
## (0.012) (0.012) (0.010) (0.010)
## LoanStatus: Past Due (1-15 days)/Chargedoff -0.030*** -0.031*** -0.010*** -0.011***
## (0.002) (0.002) (0.002) (0.002)
## LoanStatus: Past Due (16-30 days)/Chargedoff -0.031*** -0.031*** -0.011*** -0.013***
## (0.003) (0.003) (0.003) (0.002)
## LoanStatus: Past Due (31-60 days)/Chargedoff -0.027*** -0.027*** -0.005* -0.007**
## (0.003) (0.003) (0.002) (0.002)
## LoanStatus: Past Due (61-90 days)/Chargedoff -0.023*** -0.023*** -0.005* -0.007**
## (0.003) (0.003) (0.002) (0.002)
## LoanStatus: Past Due (91-120 days)/Chargedoff -0.021*** -0.021*** -0.005 -0.007**
## (0.003) (0.003) (0.002) (0.002)
## EmploymentStatus: Full-time/Employed -0.004*** -0.025*** -0.024***
## (0.001) (0.001) (0.001)
## EmploymentStatus: Not employed/Employed 0.035*** 0.021*** 0.022***
## (0.002) (0.002) (0.002)
## EmploymentStatus: Other/Employed -0.001 0.001 0.002**
## (0.001) (0.001) (0.001)
## EmploymentStatus: Part-time/Employed -0.007* -0.029*** -0.029***
## (0.003) (0.003) (0.003)
## EmploymentStatus: Retired/Employed 0.001 -0.018*** -0.017***
## (0.002) (0.002) (0.002)
## EmploymentStatus: Self-employed/Employed -0.009*** -0.012*** -0.009***
## (0.001) (0.001) (0.001)
## monthcut: (13,18]/(-1,13] 0.045*** 0.044***
## (0.001) (0.001)
## monthcut: (18,38]/(-1,13] 0.056*** 0.058***
## (0.000) (0.000)
## monthcut: (38,100]/(-1,13] 0.077*** 0.079***
## (0.001) (0.001)
## Term 0.001***
## (0.000)
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.422 0.506 0.559 0.614 0.617 0.694 0.719
## adj. R-squared 0.422 0.506 0.559 0.614 0.617 0.693 0.719
## sigma 0.057 0.052 0.050 0.046 0.046 0.041 0.040
## F 61989.938 43399.745 21501.861 9005.795 6506.329 7998.511 8693.474
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 123078.549 129702.021 134535.681 140222.959 140522.009 149984.946 153706.340
## Deviance 273.094 233.621 208.465 182.312 181.031 144.840 132.676
## AIC -246151.098 -259396.041 -269057.362 -280411.918 -280998.018 -299917.892 -307358.680
## BIC -246123.052 -259358.647 -268991.921 -280252.990 -280782.998 -299674.826 -307106.266
## N 84853 84853 84853 84853 84853 84853 84853
## =================================================================================================================================================================
This linear model can account for 71.9% of the variance in the BorrowerRate.
##
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loan_latest)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loan_latest)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination,
## data = loan_latest)
## m4: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination +
## LoanStatus, data = loan_latest)
## m5: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination +
## LoanStatus + EmploymentStatus, data = loan_latest)
## m6: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination +
## LoanStatus + EmploymentStatus + amountcut, data = loan_latest)
## m7: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + LoanMonthsSinceOrigination +
## LoanStatus + EmploymentStatus + amountcut + Term, data = loan_latest)
##
## =================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 0.279*** 0.596*** 0.616*** 0.628*** 0.623*** 0.612*** 0.574***
## (0.000) (0.003) (0.003) (0.003) (0.003) (0.003) (0.003)
## ProsperScore -0.019*** -0.016*** -0.015*** -0.015*** -0.015*** -0.014*** -0.014***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## CreditScore -0.000*** -0.001*** -0.001*** -0.001*** -0.000*** -0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## LoanMonthsSinceOrigination 0.004*** 0.004*** 0.003*** 0.003*** 0.003***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## LoanStatus: Completed/Chargedoff -0.009*** -0.008*** -0.008*** -0.004*
## (0.002) (0.002) (0.002) (0.002)
## LoanStatus: Current/Chargedoff -0.013*** -0.012*** -0.010*** -0.008***
## (0.002) (0.002) (0.002) (0.002)
## LoanStatus: Defaulted/Chargedoff 0.015* 0.013 0.008 0.013*
## (0.007) (0.007) (0.007) (0.006)
## LoanStatus: FinalPaymentInProgress/Chargedoff -0.011** -0.011** -0.012** -0.005
## (0.004) (0.004) (0.004) (0.003)
## LoanStatus: Past Due (>120 days)/Chargedoff -0.005 -0.005 0.000 -0.006
## (0.015) (0.015) (0.014) (0.013)
## LoanStatus: Past Due (1-15 days)/Chargedoff -0.006* -0.005 -0.004 -0.000
## (0.003) (0.003) (0.003) (0.003)
## LoanStatus: Past Due (16-30 days)/Chargedoff -0.002 -0.003 -0.003 -0.000
## (0.004) (0.004) (0.004) (0.004)
## LoanStatus: Past Due (31-60 days)/Chargedoff 0.001 0.001 0.001 0.004
## (0.003) (0.003) (0.003) (0.003)
## LoanStatus: Past Due (61-90 days)/Chargedoff 0.004 0.006 0.005 0.006
## (0.004) (0.004) (0.004) (0.003)
## LoanStatus: Past Due (91-120 days)/Chargedoff 0.004 0.005 0.005 0.006
## (0.004) (0.004) (0.004) (0.004)
## EmploymentStatus: Full-time/Employed 0.002 -0.001 -0.001
## (0.002) (0.002) (0.002)
## EmploymentStatus: Not employed/Employed 0.084*** 0.072*** 0.078***
## (0.004) (0.004) (0.004)
## EmploymentStatus: Other/Employed 0.002** -0.002** -0.001
## (0.001) (0.001) (0.001)
## EmploymentStatus: Part-time/Employed -0.009 -0.012 -0.019
## (0.019) (0.018) (0.017)
## EmploymentStatus: Retired/Employed 0.005 0.001 0.002
## (0.006) (0.006) (0.005)
## EmploymentStatus: Self-employed/Employed -0.015*** -0.014*** -0.011***
## (0.001) (0.001) (0.001)
## amountcut: (4000,10000]/(0,4000] -0.020*** -0.025***
## (0.000) (0.000)
## amountcut: (10000,15000]/(0,4000] -0.021*** -0.029***
## (0.000) (0.000)
## amountcut: (15000,35000]/(0,4000] -0.016*** -0.024***
## (0.001) (0.001)
## Term 0.001***
## (0.000)
## -----------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.568 0.664 0.704 0.705 0.711 0.729 0.762
## adj. R-squared 0.568 0.664 0.704 0.705 0.711 0.729 0.762
## sigma 0.039 0.035 0.033 0.033 0.032 0.031 0.029
## F 58351.104 43873.969 35204.370 8164.812 5757.224 5425.511 6167.393
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 80530.088 86117.478 88933.515 89015.274 89490.969 90887.609 93745.795
## Deviance 69.214 53.818 47.408 47.234 46.233 43.415 38.172
## AIC -161054.175 -172226.956 -177857.030 -178000.548 -178939.938 -181727.218 -187441.589
## BIC -161028.071 -172192.150 -177813.524 -177870.028 -178757.210 -181518.386 -187224.056
## N 44415 44415 44415 44415 44415 44415 44415
## =================================================================================================================================================================
For loans in the last 12 months, the linear model with the same variables will account for 76.2% of the variance.
Multivariate Analysis
Relationships Observed in Investigation
Holding ProsperScore constant, we can see that CreditScore, LoanOriginalAmount have negtive relationship with BorrowerRate, and LoanStatus, LoanMonthsSinceOrigination and Term all have some influence on BorrwerRate and the relationship between BorrwerRate and ProsperScore.
Interesting Interactions Between Features
From the plots, we can see that in different periods of time, the ProsperScore has different relationships with BorrwerRate. For loans in the last 12 months, the relationship between BorrowerRate and ProsperScore is more clear and linear. Also, the relationship between BorrowerRate and CreditScore is more clear in the last 12/13 months.
For loans of different amounts, the relationships between BorrowerRate and ProsperScore are different.
Linear Model
I created a linear model to predict the BorrowerRate. This linear model account for 71.9% of the total variance of BorrowerRate. ProsperScore account for 42.2% of the variance. The variables CreditScore, amountcut (LoanMonthsSinceOrigination), LoanStatus, monthcut (LoanMonthsSinceOrigination), Term each improve the R^2 value by a few percent. Surprisingly, EmploymentStatus didn’t improve the model very much.
27.8% of the variance can’t be explained. One reason may be that some relationships between the BorrowerRate and other variables are not linear. Another reason is that the model is different in different periods of time. Another reason may be that there are factors that are not considered here.
According to the plot, the model to predict the BorrowerRate may be quite different in different periods of time. If I build the model just for loans started in the last 12 months. The model accounts for 76.2% of the variance, and ProsperScore alone account for 56.8% of the variance, which is quite different from the model from the model for all loans with ProsperScore. This means that BorrowerRate is more closely related to ProsperScore in the last 12 months.
Final Plots and Summary
Plot One

Description One
The majority of observations have the BorrowRate between 0.05 and 0.35. The interval near 0.32 has an unusual high frequency. Except for 0.32, the distribution peaks around 0.15.
Plot Two

Description Two
The mean of BorrowerRate (red line) decreases aas ProsperScore increases. The variance of BorrowerRate is big, but the conditional mean shows a strong linear relationship.
Plot Three

Description Three
If we count for constant ProsperScore value, the pattern shows that loans with higher CreditScore tend to have lower BorrowerRate across each level of LoanOriginalAmount. But the specific pattern between BorrowerRate, ProsperScore and CreditScore is different for different levels of LoanOriginalAmount. Loans with smaller original amount tend to have higher BorrowerRate and bigger variance of BorrowerRate.
Reflection
This data set contains 113,937 loans with 81 variables on each loan. I select 13 variables to explore. I first explored the distribution of each variable, and then I explored the relationships between 2 variables. Finally, I explore the BorrowerRate across many variables and created a linear model.
The BorrowerRate is strongly negatively related with ProsperScore. Holding ProsperScore constant, CreditScore, LoanOriginalAmount have negtive relationship with BorrowerRate, and LoanStatus, LoanMonthsSinceOrigination and Term all have some influence on BorrwerRate and the relationship between BorrwerRate and ProsperScore. Most of the variables I explored have some level of relationship with the BorrowerRate and helped improved the model. But EmploymentStatus doesn’t seem to help much.
I built a linear model to predit the BorrowerRate. I tried to add many variables to linear model to improve the model. But the R^2 value of the linear model is still not very high. I think for future work, maybe the model should be more than a linear model, and different models should be made for different periods of time. We should also explore more variables from the dataset in the future.